Introduction

The goal of this project is to create a machine learning model that can successfully predict whether a patient will die due to heart failure based off of some patient history and vital signs.

What is Heart Failure

Although heart failure sounds like the heart may have stopped, this is not the case. Heart failure, which is also known as congestive heart failure is a serious, incurable condition where the heart does not work properly and fails to pump blood sufficiently throughout the body for its needs. Heart failure may occur of the heart can’t fill up with enough blood or if the heart is simply too weak to properly pump.

According to the Center for Disease Control and Prevention, more than 6 million adults in the United States suffer from heart failure.

According to the National Heart, Lung, and Blood Institute (NHLBI), “Heart failure may not cause symptoms right away. But eventually, you may feel tired and short of breath and notice fluid buildup in your lower body, around your stomach, or your neck.” Heart failure can also eventually cause damage to other organs such as the liver or kidneys and lead to other conditions such as pulmonary hypertension, heart valve disease, and sudden cardiac arrest.

Although heart disease is incurable, the Mayo Clinic states that “Proper treatment can improve the signs and symptoms of heart failure and may help some people live longer,” and that “Lifestyle changes - such as losing weight, exercising, and managing stress - can improve your quality of life” (Staff 2021)

Why Predict Death by Heart Failure

Although heart failure may be incurable, it could still be beneficial for medical professionals to predict whether a patient may develop and potentially die from heart failure. For example, if a doctor can determine with high probability that a patient may develop heart failure later in life, they may be able to inform the patient so that they can make lifestyle changes early enough to prevent the most significant symptoms.

Additionally, although the body initially tries to mask the problem of heart failure through various mechanisms such as enlarging the heart, developing more muscle mass, or pumping faster, these solutions are all temporary and in these cases, heart failure will simply progress until the onset of more serious symptoms such as fatigue or breathing problems. Since treatment can often slow down the progression of heart failure, having a machine learning model that could successfully predict a person’s chances of suffering and hence dying from heart failure would mean that we could increase early detection and likely catch more cases early on and slow the progression of the disease.

Since the data set I will use includes deaths as a result of heart failure, creating an effective machine learning model out of this data set would also allow doctors to preemptively begin treatment that may prevent the patient from dying due to heart failure.

About the Data set

This data set was assembled as part of a study conducted on heart failure patients who were admitted to Institute of Cardiology and Allied hospital Faisalabad-Pakistan between April-December 2015 (Ahmad T 2017). All patients in this case had left-ventricular systolic dysfunction, meaning that the left ventricle was unable to contract vigorously, which would indicate a pumping problem (Staff 2021). Furthermore, patients in this study all fell into the New York Heart Association (NYHA) Functional Classification levels III and IV.

Project Outline

Exploratory Data Analysis

Loading in Packages and Data

We will first begin by loading in the packages we will use for the project and by loading raw heart failure data to the variable heartfailure_data.

# Loading in libraries we will be using 
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(knitr)
library(corrplot)
library(ggthemes)
library(gt)
library(gtExtras)
library(visdat)
library(fastDummies)
tidymodels_prefer()
# Read raw data into a data frame. 
heartfailure_data <- read_csv("heart_failure_clinical_records_dataset.csv")

head(heartfailure_data) %>%
  gt() %>%
  gt_theme_nytimes() %>%
  tab_header("Heart Failure Data") 
Heart Failure Data
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
75 0 582 0 20 1 265000 1.9 130 1 0 4 1
55 0 7861 0 38 0 263358 1.1 136 1 0 6 1
65 0 146 0 20 0 162000 1.3 129 1 1 7 1
50 1 111 0 20 0 210000 1.9 137 1 0 7 1
65 1 160 1 20 0 327000 2.7 116 0 0 8 1
90 1 47 0 40 1 204000 2.1 132 1 1 8 1

The data was obtained from the Kaggle Data set “Heart Failure Prediction”, with the original data being from a study conducted by Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza. \[\\\] ### Tidying Our Data

We can now look at some basic information about the size of our data set:

dim(heartfailure_data)
## [1] 299  13

We can see that our data set has 299 observations to go along with 13 variables. Let us now take a look at a summary of our variables:

vis_dat(heartfailure_data)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
##   Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.

We can see that our data does not include any missing data, so that is not something that we need to worry about. We also see that all of our data is of type numeric, even though some of our variables, including anaemia, diabetes, high_blood_pressure, sex, smoking, and DEATH_EVENT are binary, so we will have to deal with that.

heartfailure_data$anaemia <- as.factor(heartfailure_data$anaemia)
heartfailure_data$diabetes <- as.factor(heartfailure_data$diabetes)
heartfailure_data$high_blood_pressure <- as.factor(heartfailure_data$high_blood_pressure)
heartfailure_data$sex <- as.factor(heartfailure_data$sex)
heartfailure_data$smoking <- as.factor(heartfailure_data$smoking)
heartfailure_data$DEATH_EVENT <- as.factor(heartfailure_data$DEATH_EVENT)

When looking at information about the original data set, I also noticed that time indicated either the number of days until the patients died, or the number of days until the patient was censored, which in this case simply means that they did not die. Due to this, I have decided that this information would not only be hard to interpret, it would also be irrelevant to whether the patient actually died of heart failure or not so will elect to remove that from the data set I will use for the machine learning models.

heartfailure_data <- heartfailure_data %>% select(-time)

Variable Breakdown

Now we are left with the following variables which will be utilized for the machine learning model: - age: The age of the patients in the study - anaemia: Patients were considered anemic (indicated by a 1) if their haematocrit levels were lower than 36%, indicated by 0 if patient was not anemic. - creatinine_phosphokinase: The amount of creatinine phosphokinase (CPK) in the blood. CPK is often released into the blood when muscle tissue gets damaged. - diabetes: 1 if the patient has diabetes, 0 if patient does not have diabetes. - ejection_fraction: Indicates the percentage of blood the left ventricle pumped out upon each contraction. - platelets: Result of platelet count, which measure the number of platelets in the blood. - serum_creatinine: Creatinine levels in the blood. High serum creatinine levels indicate that the kidneys may not be functioning properly (Roth 2019). - serum_sodium: Results of a blood sodium test. Low serum sodium levels may be an indicator of heart failure (Case-Lo 2018) - sex: 1 if the patient is male, 0 if the patient is female. - smoking: 1 if the patient smokes, o if the patient does not smoke. - DEATH_EVENT: 1 if the patient died during the course of the study, 0 if the patient did not die during the course of the study.

Visual EDA

Heart Failure Deaths Distribution

We will first look at the distribution of heart failure deaths

ggplot(heartfailure_data, aes(x = as.factor(DEATH_EVENT), fill = "#69b3a2")) + 
  geom_bar() + 
  scale_fill_manual(values = "#69b3a2") +
  labs(title = "Distribution of Heart Failure Deaths", x = "Death Event", y = "Count") +
  theme(legend.position = "none")

From the histogram, we see that most of the patients in the study did not die during the duration of the study. In fact, of the 299 observations, we 32.1% of the patients died and 67.9% of the patients did not die.

Variable Correlation Plot

Age

ggplot(data = heartfailure_data, aes(x = age, group = DEATH_EVENT, fill = DEATH_EVENT)) + 
  geom_density(adjust = 1.5, alpha = .4) +
  scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) %>%
  labs(title = "Distribution of Patients who Lived/Died during Study", x = "Age", y = "Density") 

ggplot(data = heartfailure_data, aes(x = age, group = DEATH_EVENT, fill = DEATH_EVENT)) +
  geom_histogram(bins = 39) %>%
  scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) %>%
  labs(title = "Distribution of Patients who Lived/Died during Study", x = "Age", y = "Count") 

References

Ahmad T, Bhatti SH, Munir A. 2017. “Survival Analysis of Heart Failure Patients: A Case Study.” PLOS ONE. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0181001.
Case-Lo, Christine. 2018. “Blood Sodium Test.” https://www.healthline.com/health/sodium-blood.
Roth, Erica. 2019. “Creatinine Blood Test.” https://www.healthline.com/health/creatinine-blood#procedure.
Staff, Mayo Clinic. 2021. “Heart Failure.” https://www.mayoclinic.org/diseases-conditions/heart-failure/symptoms-causes/syc-20373142.